high-dimensional logistic regression
The Impact of Regularization on High-dimensional Logistic Regression
Logistic regression is commonly used for modeling dichotomous outcomes. In the classical setting, where the number of observations is much larger than the number of parameters, properties of the maximum likelihood estimator in logistic regression are well understood. Recently, Sur and Candes~\cite{sur2018modern} have studied logistic regression in the high-dimensional regime, where the number of observations and parameters are comparable, and show, among other things, that the maximum likelihood estimator is biased. In the high-dimensional regime the underlying parameter vector is often structured (sparse, block-sparse, finite-alphabet, etc.) and so in this paper we study regularized logistic regression (RLR), where a convex regularizer that encourages the desired structure is added to the negative of the log-likelihood function. An advantage of RLR is that it allows parameter recovery even for instances where the (unconstrained) maximum likelihood estimate does not exist. We provide a precise analysis of the performance of RLR via the solution of a system of six nonlinear equations, through which any performance metric of interest (mean, mean-squared error, probability of support recovery, etc.) can be explicitly computed. Our results generalize those of Sur and Candes and we provide a detailed study for the cases of $\ell_2^2$-RLR and sparse ($\ell_1$-regularized) logistic regression. In both cases, we obtain explicit expressions for various performance metrics and can find the values of the regularizer parameter that optimizes the desired performance. The theory is validated by extensive numerical simulations across a range of parameter values and problem instances.
SLOE: A Faster Method for Statistical Inference in High-Dimensional Logistic Regression
Logistic regression remains one of the most widely used tools in applied statistics, machine learning and data science. However, in moderately high-dimensional problems, where the number of features $d$ is a non-negligible fraction of the sample size $n$, the logistic regression maximum likelihood estimator (MLE), and statistical procedures based the large-sample approximation of its distribution, behave poorly. Recently, Sur and Candès (2019) showed that these issues can be corrected by applying a new approximation of the MLE's sampling distribution in this high-dimensional regime. Unfortunately, these corrections are difficult to implement in practice, because they require an estimate of the \emph{signal strength}, which is a function of the underlying parameters $\beta$ of the logistic regression. To address this issue, we propose SLOE, a fast and straightforward approach to estimate the signal strength in logistic regression. The key insight of SLOE is that the Sur and Candès (2019) correction can be reparameterized in terms of the corrupted signal strength, which is only a function of the estimated parameters $\widehat \beta$. We propose an estimator for this quantity, prove that it is consistent in the relevant high-dimensional regime, and show that dimensionality correction using SLOE is accurate in finite samples. Compared to the existing ProbeFrontier heuristic, SLOE is conceptually simpler and orders of magnitude faster, making it suitable for routine use. We demonstrate the importance of routine dimensionality correction in the Heart Disease dataset from the UCI repository, and a genomics application using data from the UK Biobank.
Variational empirical Bayes variable selection in high-dimensional logistic regression
Logistic regression involving high-dimensional covariates is a practically important problem. Often the goal is variable selection, i.e., determining which few of the many covariates are associated with the binary response. Unfortunately, the usual Bayesian computations can be quite challenging and expensive. Here we start with a recently proposed empirical Bayes solution, with strong theoretical convergence properties, and develop a novel and computationally efficient variational approximation thereof. One such novelty is that we develop this approximation directly for the marginal distribution on the model space, rather than on the regression coefficients themselves. We demonstrate the method's strong performance in simulations, and prove that our variational approximation inherits the strong selection consistency property satisfied by the posterior distribution that it is approximating.
- Asia > Middle East > Jordan (0.04)
- Asia > Bangladesh > Dhaka Division > Dhaka District > Dhaka (0.04)
- North America > United States > North Carolina (0.04)
- Research Report > New Finding (0.72)
- Research Report > Experimental Study (0.72)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.88)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.66)
Reviews: The Impact of Regularization on High-dimensional Logistic Regression
Originality: This paper develops asymptotics theory for high-dimensional regularized logistic regression (LR). The main result of the paper (Theorem 1) is proved for any locally-Lipschitz function \Psi which then in special cases provides asymptotics for common descriptive statistics like correlation, variance, mean-squared error. Special case results for L1 and L2 regularized LR are also derived and quantities highlighted in 1 above are derived. The paper also demonstrates that the numerical simulation results align with the theoretical relations. Quality: The paper contains high quality results and proofs, the notation and setup is well defined in section 2 before the main results.
- Research Report > New Finding (0.63)
- Research Report > Experimental Study (0.63)
Reviews: The Impact of Regularization on High-dimensional Logistic Regression
The authors study the limiting distribution of certain functionals of the penalized maximum likelihood estimator in regression. The paper contains nontrivial new extensions of the work of Sur and Candes in the unpenalized case, and is well-written and interesting. The reviews were mostly positive and the paper is in good shape.
- Research Report > New Finding (0.40)
- Research Report > Experimental Study (0.40)
SLOE: A Faster Method for Statistical Inference in High-Dimensional Logistic Regression
Logistic regression remains one of the most widely used tools in applied statistics, machine learning and data science. However, in moderately high-dimensional problems, where the number of features d is a non-negligible fraction of the sample size n, the logistic regression maximum likelihood estimator (MLE), and statistical procedures based the large-sample approximation of its distribution, behave poorly. Recently, Sur and Candès (2019) showed that these issues can be corrected by applying a new approximation of the MLE's sampling distribution in this high-dimensional regime. Unfortunately, these corrections are difficult to implement in practice, because they require an estimate of the \emph{signal strength}, which is a function of the underlying parameters \beta of the logistic regression. To address this issue, we propose SLOE, a fast and straightforward approach to estimate the signal strength in logistic regression.
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
The Impact of Regularization on High-dimensional Logistic Regression
Logistic regression is commonly used for modeling dichotomous outcomes. In the classical setting, where the number of observations is much larger than the number of parameters, properties of the maximum likelihood estimator in logistic regression are well understood. Recently, Sur and Candes \cite{sur2018modern} have studied logistic regression in the high-dimensional regime, where the number of observations and parameters are comparable, and show, among other things, that the maximum likelihood estimator is biased. In the high-dimensional regime the underlying parameter vector is often structured (sparse, block-sparse, finite-alphabet, etc.) and so in this paper we study regularized logistic regression (RLR), where a convex regularizer that encourages the desired structure is added to the negative of the log-likelihood function. An advantage of RLR is that it allows parameter recovery even for instances where the (unconstrained) maximum likelihood estimate does not exist.
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
The Impact of Regularization on High-dimensional Logistic Regression
Salehi, Fariborz, Abbasi, Ehsan, Hassibi, Babak
Logistic regression is commonly used for modeling dichotomous outcomes. In the classical setting, where the number of observations is much larger than the number of parameters, properties of the maximum likelihood estimator in logistic regression are well understood. Recently, Sur and Candes \cite{sur2018modern} have studied logistic regression in the high-dimensional regime, where the number of observations and parameters are comparable, and show, among other things, that the maximum likelihood estimator is biased. In the high-dimensional regime the underlying parameter vector is often structured (sparse, block-sparse, finite-alphabet, etc.) and so in this paper we study regularized logistic regression (RLR), where a convex regularizer that encourages the desired structure is added to the negative of the log-likelihood function. An advantage of RLR is that it allows parameter recovery even for instances where the (unconstrained) maximum likelihood estimate does not exist.
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
The phase transition for the existence of the maximum likelihood estimate in high-dimensional logistic regression
Candes, Emmanuel J., Sur, Pragya
This paper rigorously establishes that the existence of the maximum likelihood estimate (MLE) in high-dimensional logistic regression models with Gaussian covariates undergoes a sharp `phase transition'. We introduce an explicit boundary curve $h_{\text{MLE}}$, parameterized by two scalars measuring the overall magnitude of the unknown sequence of regression coefficients, with the following property: in the limit of large sample sizes $n$ and number of features $p$ proportioned in such a way that $p/n \rightarrow \kappa$, we show that if the problem is sufficiently high dimensional in the sense that $\kappa > h_{\text{MLE}}$, then the MLE does not exist with probability one. Conversely, if $\kappa < h_{\text{MLE}}$, the MLE asymptotically exists with probability one.
- North America > United States > California > Santa Clara County > Stanford (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Research Report > New Finding (0.49)
- Research Report > Experimental Study (0.35)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.71)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.71)